Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
نویسندگان
چکیده
Abstract Background Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be optimal encoding for a given task, which also contributes poor predictive capabilities. To address these issues, we present Harvestman , method takes advantage hierarchical relationships among possible biological interpretations representations genomic variants perform automatic feature learning, selection, model building. Results We demonstrate scales thousands genomes comprising more than 84 million by processing phase 3 1000 Genomes Project, one largest publicly available collection whole genome sequences. Using breast cancer The Cancer Genome Atlas, show selects rich combination are adapted performs better binary representation SNPs alone. compare existing selection methods our is parsimonious —it smaller less redundant subsets while maintaining accuracy resulting classifier. Conclusion approach supervised building call data. By knowledge graph over solving an integer linear program automatically optimally finds right variants. Compared other methods, faster features parsimoniously.
منابع مشابه
A New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملCombination of Feature Selection and Learning Methods for IoT Data Fusion
In this paper, we propose five data fusion schemes for the Internet of Things (IoT) scenario,which are Relief and Perceptron (Re-P), Relief and Genetic Algorithm Particle Swarm Optimization (Re-GAPSO), Genetic Algorithm and Artificial Neural Network (GA-ANN), Rough and Perceptron (Ro-P)and Rough and GAPSO (Ro-GAPSO). All the schemes consist of four stages, including preprocessingthe data set ba...
متن کاملMental Arithmetic Task Recognition Using Effective Connectivity and Hierarchical Feature Selection From EEG Signals
Introduction: Mental arithmetic analysis based on Electroencephalogram (EEG) signal for monitoring the state of the user’s brain functioning can be helpful for understanding some psychological disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, or dyscalculia where the difficulty in learning or understanding the arithmetic exists. Most mental arithmetic recogni...
متن کاملA dual-clustering framework for association screening with whole genome sequencing data and longitudinal traits
Current sequencing technology enables generation of whole genome sequencing data sets that contain a high density of rare variants, each of which is carried by, at most, 5% of the sampled subjects. Such variants are involved in the etiology of most common diseases in humans. These diseases can be studied by relevant longitudinal phenotype traits. Tests for association between such genotype info...
متن کاملCorrection: A dual clustering framework for association screening with whole genome sequencing data and longitudinal traits
* Correspondence: [email protected] Department of Statistics, Columbia University, New York, NY 10027, USA Full list of author information is available at the end of the article Figure 1 Clustering of individuals using SNPs with MAFs between 0.01 and 0.05 for MAP4. A, Shown are 10 clusters, with the numbers at the top odds ratios within each partition block based on blood pressures. Each...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: BMC Bioinformatics
سال: 2021
ISSN: ['1471-2105']
DOI: https://doi.org/10.1186/s12859-021-04096-6